Blog Post 1

LDA
Topic Modelling
Gender bias in newspapers
Topic ideas/Literature Review
Author

Mekhala Kumar

Published

September 18, 2022

library(tidyverse)

Description of the papers

Academic articles

Article 1: Gender Bias in the News: A Scalable Topic Modelling and Visualization Framework

Authors: Prashanth Rao and Maite Taboada

Research question

Rao and Taboada aimed to study how often men and women are quoted in news articles (2021). They looked into different types of articles including sports, lifestyle, business and healthcare, to see if there was a difference in which gender was quoted in these topics.

Data

The data was from the Gender Gap Tracker Data; which scrapes data from seven major English Canadian newspapers’ websites. The data was taken for two years starting from October 2018. A total of 612,343 articles were used (Rao & Taboada, 2021).

Methods

In order to analyse the data, first the data about the people mentioned as well as the people quoted was taken into account. The gender of the speaker was determined based on a “cache of commonly quoted public figures” and an API which had the gender stored based on the person’s full or first name. The people who were quoted were a subset of the people mentioned and if they were quoted multiple times within the same article, their name was only counted once. The authors did acknowledge that this process limited the genders to a binary and could not take other genders into account. The authors utilised topic modelling with the Apache Spark’s parallel Latent Dirichlet Allocation in Python. The preprocessing steps included tokenisation, normalisation, lowercasing, stopword removal and lemmatisation. Tokenisation and normalisation removed the symbols and artifacts that were not required. Stopword removal removed common words that are not useful for interpretation. Lemmatisation looked at if the tokens were present in a dictionary of lemmas They also used the language analyses of keyness and dependency bigrams (Rao & Taboada, 2021).

Findings

In general, it was found that in the Lifestyle, Entertainment, Arts and Healthcare categories, women were quoted more whereas for the topics of Business, Sports and United States Politics, men were quoted more often. Men were more likely to be quoted in high-profile articles and women were more likely to be quoted in low-profile articles. Two aspects I found it interesting in the study was that the authors looked into the type of information that was reported for men and women and also how the type of news articles reported corresponded to particular events. For men’s sports, there were descriptions of the events whereas for women’s sports, their achievements were focused upon. In the business section, the male corpus had bigrams such as stocks and trading but the male corpus had small transactions, shopping and small or local businesses. Finally, it was seen that information about women’s and trans’ rights tended to be more in the news in February and March (both in 2019 and 2020). This corresponds to the date of International Women’s day in March (Rao & Taboada, 2021).

Article 2: Semi-Supervised Topic Modeling for Gender Bias Discovery in English and Swedish Authors: Hannah Devinney, Jenny Bjorklund and Henrik Bjorklund

Research question

Devinney et al. looked into whether there is gender bias in English and Swedish newspaper articles and English web content (2020).

Data

There were three corpora in total: Mainstream English news articles, Mainstream Swedish articles and LGBTQ+ web content in English. Both the mainstream English and Swedish articles were collected from 2019 whereas the web content was collected for five weeks between January and February 2020 (Devinney et al., 2020).

Methods

They used quantitative as well as qualitative methods. I will focus on the quantitative methods utilised. Semi-supervised topic modelling was used and the topic models were trained with two sets of seed words for fifteen topics. This was done to check for differences in the words and concepts that women, men and nonbinary are linked to. The parallel semi-supervised Latent Dirichlet Allocation was used for topic inference. Additionally, they trained one baseline, unsupervised topic model for all three corpora in order to find out if there were any implicitly-gendered topics. The preprocessing steps included tokenisation, lemmatisation and POS (Part-of-Speech) tagging in English corpora (Devinney et al., 2020).

Findings

In the mainstream corpora, it was found that the words included in the masculine seed lists were present more than the words included in the neutral seed lists and two times more frequently than the words in the feminine seed list. On the other hand, in the queer web content corpora, there was a better balance of the words from all the gender seed lists; there was also more detailed non-binary representation as words such as ze, xe, nonbinary, enby and genderqueer were also found. In all three corpora, the feminine topics were linked to the private sphere whereas the masculine topics were linked to the public sphere. There were some variations in the masculine topics among the mainstream English corpora, mainstream Swedish corpora and Queer web content corpora. In the mainstream English corpora, there are more neutral terms used in the articles in the public sphere. Essentially, men are generalised as neutral. On the other hand, in the Mainstream Swedish corpora, most of the masculine topics were associated with crime and death or Christianity (Devinney et al., 2020).

Topic ideas

  1. I found Rao and Taboada’s research of the frequency of women and men being quoted interesting and one research idea I had was to apply similar methods to the context of American newspapers or Indian newspapers.Perhaps, the main topics researched would be limited to two categories like sports and entertainment.

A related idea to this would be comparing how gender may be discussed differently in 2 newspaper companies such as the New York Times and Los Angeles Times.

  1. A second idea I had was Observing the type of language used to describe advertisements for different genders in either the United States or India.

  2. A third idea would be looking at Op-Ed pages and observing the gender that is featured more or if the genders tend to discuss a particular topic.

The sources for both would be newspaper articles. 1. This was an online archive I was able to find that has American newspapers: https://guides.loc.gov/consumer-advertising-great-depression/databases-and-archives

  1. In India, the Times of India and The Hindu are popular newspapers. I was able to find online archives for them as well. https://timesofindia.indiatimes.com/archive.cms https://www.thehindu.com/archive/

Doubts

  1. In the Indian archives, the articles are not always separated by topic. Therefore, it would be difficult to do an analysis of the discussion of gender based on the newspaper topic.

  2. Some of the archives are not free to use, so I was not sure how to collect data.

  3. Some of the sites can detect bots, if this is the case, will it create issues if we are to use WebScraping?

  4. I am still not sure which text-as-data method I would apply for my research topic ideas.

  5. In one of the articles that I read, they discussed having a pipeline but I didn’t understand what that meant?

References

  • Rao P and Taboada M (2021) Gender Bias in the News: A Scalable Topic Modelling and Visualization Framework. Front. Artif. Intell. 4:664737. doi: 10.3389/frai.2021.664737
  • Devinney,H., Björklund,J. & Björklund,H.(2020). Semi-Supervised Topic Modeling for Gender Bias Discovery in English and Swedish. Proceedings of the Second Workshop on Gender Bias in Natural Language Processing, 79–92. https://aclanthology.org/2020.gebnlp-1.8